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Abstract 

We consider the task of opportunistic channel access in a primary system composed of independent 
Gilbert-Elliot channels where the secondary (or opportunistic) user does not dispose of a priori informa- 
tion regarding the statistical characteristics of the system. It is shown that this problem may be cast 
into the framework of model-based learning in a specific class of Partially Observed Markov Decision 
Processes (POMDPs) for which we introduce an algorithm aimed at striking an optimal tradeoff between 
the exploration (or estimation) and exploitation requirements. We provide finite horizon regret bounds 
for this algorithm as well as a numerical evaluation of its performance in the single channel model as well 
as in the case of stochastically identical channels. 

1 Introduction 

In recent years, opportunistic spectrum access for cognitive radio has been the focus of significant research 
efforts [TJ [5J UOj . These works propose to improve spectral efficiency by making smarter use of the large 
portion of the frequency bands that remains unused. In Licensed Band Cognitive Radio, the goal is to share 
the bands licensed to primary users with non primary users called secondary users or cognitive users. These 
secondary users must carefully identify available spectrum resources and communicate avoiding to disturb 
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the primary network. Opportunistic spectrum access thus has the potential for significantly increasing the 
spectral efficiency of wireless networks. 

In this paper, we focus on the opportunistic communication model previously considered by [HE], which 
consists of N channels in which a single secondary user searches for idle channels temporarily unused by 
primary users. The N channels are modeled as Gilbert-Elliot channels: at each time slot, a channel is either 
idle or occupied and the availability of the channel evolves in a Markovian way. Assuming that the secondary 
user can only sense M <S N channels simultaneously [SJ [5J [TC] , his main task is to choose which channel 
to sense at each time aiming to maximise its expected long-term transmission efficiency. Under this model, 
channel allocation may be interpreted as a planning task in a particular class of Partially Observed Markov 
Decision Process (POMDP) also called restless bandits [51 117). 

In the works of [5J[T21[T7], it is assumed that the statistical information about the primary users' traffic 
is fully available to the secondary user. In practice however, the statistical characteristics of the traffic are 
not fixed a priori and must be somehow estimated by the secondary user. As the secondary user selects 
channels to sense, we are not faced with a simple parameter estimation problem but with a task which is 
closer to reinforcement learning [13] . We consider scenarios in which the secondary user first carries out 
an exploration phase in which the statistical information regarding the model is gathered and then follows 
by the exploitation phase, where the optimal sensing policy, based on the estimated parameters, is applied. 
The key issue is to reach the proper balance between exploration and exploitation. This issue has been 
considered before by [5] who proposed an asymptotic rule to set the length of the exploration phase but 
without a precise evaluation of the performance of this approach. Lai et al [6] also considered this problem 
in the multiple secondary users case but in a simpler model where each channel is modeled as an independent 
and identically distributed source. In the field of reinforcement learning, this class of problems is known 
as model-based reinforcement learning for which several approaches have been proposed recently [2l I12[ [M] , 
However, none of these directly applies to the channel allocation model in which the state of the channels is 
only partially observed. 

Our contribution consists in proposing a strategy, termed Tiling Algorithm, for adaptively setting the 
length of the exploration phase. Under this strategy, the length of the exploration phase is not fixed 
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beforehand and the exploration phase is terminated as soon as we have accumulated enough statistical 
evidence to determine the optimal sensing policy. The distinctive feature of this approach is that it comes 
with strong performance guarantees in the form of finite-horizon regret bounds. For the sake of clarity, this 
strategy is described in the general abstract framework of parametric POMDPs. Remark that the channel 
access model corresponds to a specific example of POMDP parameterized by the transition probabilities of 
the availability of each channel. As the approach relies on the restrictive assumption that for each possible 
parameter value the solution of the planning problem be fully known, it is not applicable to POMDPs at 
large but is well suited to the case of the channel allocation model. We provide a detailed account of the 
use of the approach for two simple instances of the opportunistic channel access model, including the case 
of stochastically identical channels considered by [TB] . 

The article is organized as follows. The channel allocation model is formally described in Section [5] In 
Section [31 the tiling algorithm is presented and its performance in terms of finite-horizon regret bounds arc 
obtained. The application to opportunistic channel access is detailed in Section [H both in the one channel 
model and in the case of stochastically identical channels. 

2 Channel Access Model 

Consider a network consisting of N independent channels with time-varying state, with bandwidths B(i), 
for i = 1, . . . N. These N channels are licensed to a primary network whose users communicate according to 
a synchronous slot structure. At each time slot, channels arc cither free or occupied (sec Fig. [T]). Consider 
now a secondary user seeking opportunities of transmitting in the free slots of these TV channels without 
disturbing the primary network. With limited sensing, a secondary user can only access a subset of M <C N 
channels. The aim of the secondary user is to leverage this partial observation of the channels so as to 
maximize its long-term opportunities of transmission. 

Introduce the state vector which describes the network at time t, [A t (l), . . . ,X t (N)]', where X t (i) is 
equal to when the channel i is occupied and 1 when the channel is idle. The states X t (i) and X t (j) of 
different channels i ^ j are assumed to be independent. Let a(i) (resp. /?(«)) be the transition probability 
from state (resp. 1) to state 1 in channel i (see Fig. [5]). Additionally, denote by (vo(i), the stationary 
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Figure 1: Representation of the primary network 
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Figure 2: Transition probabilities in the z-th channel. 



probability of the Markov chain (X t (i))t- The secondary user selects a set of M channels to sense. This 
choice corresponds to an action A t = [A t (l), . . . ,A t (N)]' , where A t (i) = 1 if the i-th channel is sensed 
and A t (i) = otherwise. Since only M channels can be sensed, X^i^tW = M. The observation is an 
A^-dimensional vector [lt(l), . . . ,Yt(N)]' such that Yt(i) = Xt(i) for the M selected channels and Yt(i) is 
an arbitrary value not in {0, 1} for the other channels. The reward gained at each time slot is equal to 
the aggregated bandwidth available. In addition, a reward equal to < A < mini B(i) is received for each 
unobserved channel. At each time t, the received reward is ^2i=i r (Xt(i), A t (i)) where 



r{X t {i),A t {i)) 



B(i) ifAt(i) = i,x t (i) = Y t (i) = i 

i£A t (i) = l,Xt(i) = Yt(i) = 
A otherwise 



which depends on X t (i) only through Y t (i). The gain A associated to the action of not observing may also 
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be interpreted as a penalty for sensing occupied channels. Indeed, this model is equivalent to the one where 
a positive reward B{i) — A is received for available sensed channels, a penalty —A is received for occupied 
sensed channels and no reward are received for non-sensed channels. 

Note that this model is a particular POMDP in which the state transition probabilities do not depend 
on the actions. Moreover, the independence between the channels may be exploited to construct a TV- 
dimensional sufficient internal state which summarizes all past decisions and observations. The internal 
state pt is defined as follows: for all i £ {1, . . . TV}, pt(i) = P [Xt(i) = 1 1 Ao : t-i(i), Yo:t-i (*)]■ This internal 
state enables the secondary user to select the channels to sense. The internal state recursion is 



Pt+i(i) 



a{i) rSA t (i) = l,Y t (i) = 

/9(i) if A t (i) = l,Y t (i) = 1 ■ (1) 

Pt(i)P(i) + (1 — Pt{i))ot{i) otherwise 



Moreover, remark that at each time t, the internal state pt is completely defined by the pair (k, y) where 
y = [y(l), . . . , y(N)]' denotes the last observed state for each channel and k = [fc(l), . . . , k(N)}' is the duration 
during which the corresponding channel has not been observed. Denote by P^^^ the probability that a 
channel is free given that it has not been observed for k(i) time slots and that the last observation was y(i). 
That is to say, for k(i) > 1, p*g$$ = P[*t(») = 1| A-fc(i)+i:t-i(i) = 0,A t _ fc(i) (i) = l,Y t _ k{{) (i) = y(i)} 
and P^q prq = P [Xt(i) — 1 1 A t -i{i) — 1, Y t -i(i) = y(i)] ■ Using equation ([TJ) , these probabilities may be 
written as follows: 

m,o _ a(i)(l-(m-a(i)) k{i) ) , 7 s 
fc(i),i (^)-a(0) fc W(l-/9(»)+"W) /-n 

The channel allocation model may also be interpreted as an instance of the restless multi-armed bandit 
framework introduced by [15j . Papadimitriou and Tsitsiklis have established that the planning task in 
the restless bandit model is PSPACE-hard, and hence that optimal planning is not practically achievable 
when the number TV of channels becomes important. Nevertheless, recent works have focused on near-optimal 
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so-called index strategies [71 [H [5], which have a reduced implementation cost. An index strategy consists in 
separating the optimization task into N channel-specific sub-problems, following the idea originally proposed 
by Whittle [T5]. Interestingly, to determine the Whittle index pertaining to each channel, one has to solve 
the planning problem in the single channel model for arbitrary values of A. Using this interpretation, explicit 
expressions of the Whittle's indexes as a function of the channel transition probabilities {a(i), /3(£)}j=i,...,iv 
have been provided by [7J |H] ■ 

3 The Tiling Algorithm 

Here, we focus on determining the sensing policy when the secondary user does not have any statistical 
information about the primary users' traffic. A common approach is to learn the transition probabilities 
{ct(i), /3(*)}i=i,...,iv in a first phase and then to act optimally according to the estimated model. If the 
learning phase is sufficiently long, the estimates of the probabilities can be quite precise and there is a higher 
chance that the policy followed during the exploitation phase is indeed the optimal policy. On the other 
hand, blindly sensing channels to learn the model parameters docs not necessarily coincide with the optimal 
policy and thus has a cost in terms of performance. The question is hence: how long should the secondary 
user learn the model (explore) before applying an exploitation policy such as Whittle's policy ? 

This problem is the well known dilemma between exploration and exploitation [13j . Here we propose an 
algorithm to balance exploration and exploitation by adaptively monitoring the duration of the exploration 
phase. We present this algorithm in a more abstract framework for generality. We assume that the optimal 
policy is a known function of a low dimensional parameter. This condition can be restrictive but it is verified 
in simple cases such as finite state space MDPs or in particular cases of POMDPs like the channel access 
model (see also Section [4]). 

3.1 The Parametric POMDP Model 

Consider a POMDP defined by (X, A, Y, Qg, /, r), where X is the discrete state space, Y is the observation 
space, A is the finite set of actions, Qg : X x A x X — » [0, 1] is the transition probability, / : X x A — > Y is 
the observation function, r : X x A — > R is the bounded reward function and 8 G O denotes an unknown 
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parameter. Given the current hidden state x £ X of the system, and a control action a £ A, the probability of 
the next state a;' € X is given by Qg(x, a; x'). At each time step t, one chooses an action A t — n(Ao : t-i, ^o : t-i) 
according to a policy tt, and hence observes Y t = f(X tl A t ) and receives the reward r(X t ,A t ). Without loss 
of generality, we assume that for all igX, for all a £ A, r(x, a) < 1. 

Since we are interested in rewards accumulated over finite but large horizons, we will consider the average 
(or long-term) reward criterion defined by 



where ir denotes a fixed policy. The notation Vg is meant to highlight the fact that the average reward 
depends on both the policy 7r and the actual parameter value 9. For a given parameter value, the optimal 
long-term reward is defined as V g * = sup„. Vg and -Kg denotes the associated optimal policy. We assume that 
the dependence of Vg and with respect to 9 is fully known. In addition, there exists a particular default 
policy ttq under which the parameter 9 can be consistently estimated. 

Given the above, one can partition the parameter space into non-intersecting subsets, O = (L Zi, such 
that each policy zone Zi corresponds to a single optimal policy, which we denote by w* . In other words, for 
any 6 £ Zi, Vg* = Vg 1 . In each policy zone Zi, the corresponding optimal policy w* is assumed to be known 
as well as the long-term reward function Vg* for any 9 £ 9. 



We denote by 9t the parameter estimate obtained after t steps of the exploration policy and by A t the 
associated confidence region, whose construction will be made more precise below. The principle of the 
tiling algorithm is to use the policy zones (Zi)i to determine the length of the exploration phase: basically, 
the exploration phase will last until the estimated confidence region A t fully enters one of the policy zones. It 
turns out however that this naive principle does not allow for a sufficient control of the expected duration of 
the exploration phase, and, hence, of the algorithm's regret. In order to deal with parameter values located 
close to the borders of policy zones, one needs to introduce additional frontier zones (Fj (n))j that will shrink 




3.2 The Tiling Algorithm (TA) 
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at a suitable rate with the time horizon n. Let 

T n = mf{t > 1 : 3i, A t C Z % or 3j, A t C F 3 (n)} (4) 

denote the random instant where the exploration terminates. Note that the frontier zones (Fj(n))j depends 
on n. Indeed, the larger n the smaller the frontier zones can be in order to balance the length of the 
exploration phase and the loss due to the possible choice of a suboptimal policy. 




Figure 3: Tiling of the parameter space for an example with three distinct optimal policy zones. 

In Figure [3l we represent the tiling of the parameter space for an hypothetical example with three distinct 
optimal policy zones. In this case, there are four frontier zones: one between each pair of policy zones (-Fi(n), 
i*2(n) and ^3(71)) and another (F^n)) for the intersection of all the policy zones. In the following, we shall 
assume that there exists only finitely many distinct frontier and policy zones. 

The tiling algorithm consists in using the default exploratory policy ttq until the occurrence of the stopping 
time T n , according to From T n onward, the algorithm then selects a policy to use during the remaining 
time as follows: if at the end of the exploration phase, the confidence region is fully included in a policy 
zone Zi, then the selected policy is tt*; otherwise, the confidence region is included in a frontier zone Fj(n) 
and the selected policy is any optimal policy nt compatible with the frontier zone Fj(n). An optimal policy 
7r^ is said to be compatible with the frontier zone Fj (n) if the intersection between the policy zone and 
the frontier zone is non empty. In the example of Figure [3j for instance, 71^ and tt^ are compatible with the 
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frontier zone F\(n), while all the optimal policies (7r*)j— 1^,3 are compatible with the central frontier zone 
i*4(n). If the exploration terminates in a frontier zone, then one basically does not have enough statistical 
evidence to favor a particular optimal policy and the tiling algorithm simply selects one of the optimal 
policies compatible with the frontier zone. Hence, the purpose of frontier zones is to guarantee that the 
exploration phase will stop even for parameter values for which discriminating between several neighboring 
optimal policies is challenging. Of course, in practice, there may be other considerations that suggest to 
select one compatible policy rather than another but the general regret bound below simply assumes that 
any compatible policy is selected at the termination of the exploration phase. 

3.3 Performance Analysis 

To evaluate the performance of this algorithm, we will consider the regret, for the prescribed time horizon 
n, defined as the difference between the expected cumulated reward obtained under the optimal policy and 
the one obtained following the algorithm, 



R n (0 m )=E%* 



E. 



TA 



(5) 



where 8* denotes the unknown parameter value. To obtain bounds for R n (8*) that do not depend on 8* , we 
will need the following assumptions. 

Assumption 1. The confidence region A t is constructed so that there exists constants Ci, c' l7 n m ; n G R+ such 
that, for all 8 G 6, for all n > n min , for all t < n, P e (o e A t , S(A t ) < ci^)p) > 1 - c[ exp{-| logn} , 
where 6 (At) = sup{||0 — 6 1 ' || ^ , 8, 8' G A t } is the diameter of the confidence region. 

Assumption 2. Given a size e(n), one may construct the frontier zones (Fj(n))j such that there exists 
constants C2,c' 2 G K+ for which 

• 8 (At) < C2e(n) implies that there exists either i such that A t C Z% or j such that A t C Fj(n), 

• if 8 £ Fj(n), there exists 9' G Zi such that \\8 — 0'\\ < c' 2 e(n), for all policy zones Zi compatible with 
Fj(n) (i.e., such that Zif]Fj(n) ^ 9). 
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Assumption 3. For all i, there exists di G K + such that for all 9,9' G 0, iVg* — \ < di\\9 — O'W^ . 

Assumption [I] pertains to the construction of the confidence region and may usually be met by standard 
applications of the Hoeffding inequality. The constant 1/3 is meant to match the worst-case rate given in 
Theorem [T] below. Assumption [2] formalizes the idea that the frontier zones should allow any confidence 
region of diameter less than e(n) to be fully included either in an original policy zone or in a frontier zone, 
while at the same time ensuring that, locally the size of the frontier is of order e(n). The applicability of the 
tiling algorithm crucially depends on the construction of these frontiers. Finally Assumption [3] is a standard 
regularity condition (Lipschitz continuity) which is usually met in most applications. The performance of 
the tiling approach is given by the following theorem, which is proved in Appendix [Al 

Theorem 1. Under assumptions^ [H and\3[ and for all n > n m j n; the duration of the exploration phase is 
bounded, in expectation, by 

E e .(T n )<c^\, (6) 
e l \n) 

and the regret by 

Rn(0*) <E e ,(T n ) + c'ne{n)+c"ncxp{-hogn} , (7) 

where c — (ci/c 2 ) 2 , d = c' 2 maxj^fij + df.) and c" = d 1 . The minimal worst-case regret is obtained when 
selecting e(n) of the order of (log n/n) 1 ' 3 , which yields the bound R n (9*) < C(logn) 1 / 3 n 2 / 3 for some constant 
C. 

The duration bound in ([5]) follows from the observation that exploration is guaranteed to terminate only 
when the confidence region defined by Assumption [T] reaches a size which is of the order of the diameter of 
the frontier, that is, e(n). The second term in the right-hand side of ([7]) corresponds to the maximal regret if 
the exploration terminates in a frontier zone. The rate (logn) 1 / 3 ?! 2 / 3 is obtained when balancing these two 
terms (Eg* (T n ) and c'ne(n)). A closer examination of the proof in Appendix |A1 shows that if one can ensure 
that the exploration indeed terminates in one of the policy regions Zi , then the regret may be bounded by 
an expression similar to ([7]) but without the c'ne(n) term. In this case, by using a constant strictly larger 
than 1 — instead of 1/3 — in Assumption [TJ one can obtain logarithmic regret bounds. To do so, one however 
need to introduce additional constraints to guarantee that exploration terminates into a policy region rather 
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than in a frontier. These constraints typically take the form of an assumed sufficient margin between the 
actual parameter value 9* and the borders of the associated policy zone. This is formalized in Theorem [5] 
which is proved in Appendix [Bl First, introduce an alternative of Assumption [1] 

Assumption 4. The confidence region A t is constructed so that there exists constants ci, c[,n m i n £ R + , 
x > 1 such that, for all 9 £ 0, for all n > n min , for all t < n, V e (& £ A t , 5{A t ) < ci^J > l—d x exp{-2x} . 

Theorem 2. Consider 9* in a policy zone Z such that there exists k for which min^^ \\9* — > k. 
Under assumption [7J the regret is bounded by R n (9*) < C(k) log(n) + C'(k) for all n > n mm and for some 
constants C(k) and C'(k) which decrease with n. 



4 Application to Channel Access 

In the following, we consider two specific instances of the opportunistic channel access model introduced in 
Section^ First, we study the single channel case which is an interesting illustration of the tiling algorithm. 
Indeed, in this model, there are a lot of different policy zones and both the optimal policy and the long-term 
reward can be explicitly computed in each of them. In addition, the one channel model plays a crucial role 
in determining the Whittle index policy. Next, we apply the tiling algorithm to a N channel model with 
stochastically identical channels. 

4.1 One Channel Model 

Consider a single channel with bandwidth B = 1. At each time, the secondary user can choose to sense the 
channel hoping to receive a reward equal to 1 if the channel is idle and taking the risk of receiving no reward 
if the channel is occupied. He can also decide to not observe the channel and then to receive a reward equal 
to < A < 1. 

4.1.1 Optimal policies, long-term rewards and policy zones 

Studying the form of the optimal policy as a function of 9 = (a, /?) brings to light several optimal policy 
zones. In each zone, the optimal policy is different and is characterized by the pair (ko,ki) which defines 
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how long the secondary user needs to wait (i.e. not observe the channel) before observing the channel again 
depending on the outcome of the last observation. Denote by Trf ko fc % the policy which consists in waiting 
fco — 1 (resp. fci — 1) time slots before observing the channel again if, last time the channel was sensed, it was 
occupied (resp. idle), and by the corresponding policy zone. Let 7r^ be the policy which consists 

in never observing the channel; this policy is optimal when a and (3 are such that the probability that the 
channel is idle is always lower than A. We represent in Figure [3] the policy zones. 



(k ,1) 



0.4 0.5 0.6 0.7 0.8 0.9 1 



alpha 

Figure 4: The optimal policy regions in the one channel model with A = 0.3. 



The long-term reward of each policy can be exactly computed: 
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4.1.2 Applying the tiling algorithm 

Applying the tiling algorithm to this model is not straightforward as there are an infinity of policy zones. 
We introduce border zones between ^(2,1), ^(1,2) , Z x as shown in Figure [4] Moreover, to address 

the problem of the infinity of zones, we propose to aggregate the policy zones when a < A and j3 > A. For 
example, we aggregate all the zones Zn^^, with 2 < ko < I and the non-observation zone Z^ with the zones 
Z(k ,i) such that ko > Z', where Z' < Z are variables to be tuned according to the time horizon n. Thus, 
Theorem [JJ still applies. 

Recall that the tiling algorithm consists in learning the parameter (a,/3) until the estimated confidence 
region fully enters either one of the policy zones or one of the frontier zones. The exploration policy, denoted 
by 7Tq in Section [3J consists in always sensing the channel. At time i, the estimated parameter is given by 



TV, ' 1 - N}' 1 

&t = ~~ A/fT ail< ^ = ~\tT ' (8) 



where N® (resp. N^) is the number of visits to (resp. 1) until time t and A^ ' 1 (resp. N t ' ) is the number 
of visits to (resp. 1) followed by a visit to 1 until time t. 

In order to verify that this model satisfies the conditions of Theorem [TJ we need to make an irreducibility 
assumption on the Markov chain. 

Assumption 5. There exists r\ such that (a,/3) £ = [77, 1 — rf\ 2 . 

This condition ensures that, during the time horizon n, the Markov chain visits the two states sufficiently 
often to estimate the parameter (a, (3). We define the confidence region as the rectangle 



at ± 



'log ra 



6A? 



(9) 



To prove that the regret of the tiling algorithm in a single channel model is bounded, we need to verify 
the three assumptions of Theorem [TJ First, it is shown in appendix [Cl that Assumption [TJ holds. Secondly, 
except when a < A and f3 > A, Assumption [2] is obviously satisfied, since the confidence region and the policy 
and frontier zones arc all rectangles (see Fig.[4|). Let e(n) be half of the smallest width of the frontier zones. 
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Additionally, when a < A and j3 > A, if the center frontier zone is large enough, the aggregation of the zones 
can be done such that the second condition holds. Finally, for all optimal policy, the long-term reward is a 
Lipschitz continuous function of (a, /?) for a, (3 £ [77, 1 — 77], so the third condition is also satisfied. 

4.1.3 Experimental results 

As suggested by Theorems [2 the length of the exploration phase following the tiling algorithm depends 
on the value of the true parameter (a*,f3*). In addition, for a fixed value of (a*,/3*), the length of the 
exploration varies from one run to another, depending on the size of the confidence region. To illustrate 
these effects, we consider two different value of the parameters: (a*,/3*) = (0.8,0.05) which is included in 
the policy zone -^(1,2) and far from any frontier zone, and, (a*,/3*) = (0.8,0.2) which lies in the frontier 
zone between Z(i,i) and -^(1,2) and is close to the border of the frontier zone. The corresponding empirical 
distributions of the length of the exploration phase are represented in Figure [5] Remark that the shape of 
these two distributions are quite different and that the empirical mean of the length of the exploration phase 
is lower for a parameter which is far from any frontier zone than for a parameter which is close to the border 
of a frontier zone. 

In Figure El we compare the cumulated regrets R^ A of the tiling algorithm to the regrets R® L (lexpi) of 
an algorithm with a deterministic length of exploration phase l exp i- Both algorithms are run with (a*, 0*) = 
(0.8,0.05). We use two values of l eX pi'- one lower (l exp i = 20) and the other larger (l exp i = 300) than the 
average length of the exploration phase following the tiling algorithm which ranges between 40 and 150 
for this value of the parameter (see Fig. [5]). The algorithms are run four times independently and every 
cumulated regret are represented in Figure [HI Note that, (a*,/3*) being in the interior of a policy zone 
(i.e. not in a frontier zone), the regret of the tiling algorithm is null during the exploitation phase since 
the optimal policy for the true parameter is used. Similarly, when the deterministic length l exp i of the 
exploration phase is sufficiently large, the estimation of the parameter is quite precise, therefore the regret 
during the exploitation phase is null. On the other hand, too large a value of l exp i increases the regret during 
the exploration phase: we obervc in Figure [5] that the regret R^ L (l exp i) with l exp i = 300 is larger than R„ A - 
When the deterministic length of the exploration phase is smaller than the average length of the exploration 
phase following the tiling algorithm, either the parameter is estimated precisely enough and then R^ L (l eX pi) 
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Figure 5: Distribution of the length of the exploration phase following the tiling algorithm for (a*,/3*) 
(0.8,0.05) and for (a*,/3*) = (0.8,0.2). 
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Figure 6: Comparison of the cumulated regret of the tiling algorithm (shaped markers) and an algorithm 
with a deterministic length of exploration phase equal to 20 (dashed line) or equal to 300 (solid line) for 
(a*, 13*) = (0.8,0.05) 
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is smaller than R„ , or, the estimated value is too far away from the actual value and the policy followed 
during the exploitation phase is not the optimal one. In the latter case, the regret is not null during the 
exploitation phase and Rn L (lexpi) is noticeably large. This can be observed in Figure [5] in three of the four 
runs, the cumulated regret R^L(l exp i) with l exp i = 20 (dashed line) are small, whereas in the remaining run 
it sharply and constantly increases. 

4.2 Stochastically Identical Channels Case 

In this section, consider a full channel allocation model where all the N channels have equal bandwidth 
B = 1 and are stochastically identical in terms of primary usage, i.e. all the channels have the same 
transition probabilities: Vi G {1, . . . , N} , a% = a , fii = /3 . In addition, let A = 0. 

4.2.1 Optimal policies, long-term rewards and policy zones 

Under these assumptions, the near optimal Whittle's index policy has been shown to be equivalent to the 
myopic policy (see [5]) which consists in selecting the channels to be sensed according to the expected one- 
step reward: A t = argmax agA YliLi a (^)ptt0 given that channel i has not been observed for k(i) time 
slots and the last observation was y(i). Recall that A denote the set of iV-dimcnsional vectors with M 
components equal to 1 and N — M equal to 0. Following this policy, the secondary user senses the M 
channels that have the highest probabilities ^ * to be free. 

The resulting policy depends only on whether the system is positively correlated (a < f3) or negatively 
correlated (/? < a) (see [5] for details). To explain an important difference between the positively and 
negatively correlated cases, we represent in Figure [7] the probability p^g that the j-th channel is idle 
for y(j) = 1 and y(j) = as a function of k(j), in the two cases. We observe that, for all k(j) > 1, for all 

y(j)e{o,i}, 

< pZ = a ^p k S vU) ^p=p 1 ^ if «^ (10) 

/«j = 0<P k $ V<d) <<* = P«j if/3<«- 
Then, in the positively correlated case, according to equation p0[) . if a channel i has just been observed 
to be idle, i.e. k(i) = 1, y(i) = 1, the optimal action is to observe it once more since the channel has the 
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highest (or equal) probability to be free: for all j =/= i, p^ l l > P^y' yy ~" ■ On the contrary, if a channel has 
just been observed to be occupied, i.e. k(i) = 1, y(i) = 0, it is optimal to not observe it since the channel 
has the lowest probability to be free. When the system is negatively correlated, the policy is reversed. 

Let 7r_|_ be the policy in the positively correlated case and 7r_ the policy in the negatively correlated 
one. The long-term reward of policies ir + and 7r_ can not be computed exactly. However, one may use the 

a<p 
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Figure 7: Probabilities p„g that the j-th channel is idle for y(j) = 1 (solid line) and y{j) = (dashed 
line) as a function of k(j), in the positively (top) and the negatively (bottom) correlated cases. 

approach of [16j to compute an approximation of V^g and V*^ and obtain: 



V 



a,0 



M 



M 



1 — vi+a 



(11) 



4.2.2 Applying the tiling algorithm 

The secondary user thus needs to distinguish between values of the parameter that lead to positive or negative 
one-step correlations in the chain. Knowing which of these two alternatives applies is sufficient to determine 
the optimal policy. Let Z + and Z_ be the policy zones corresponding to these two optimal policies ir + and 
7r_ (see Figured]). Between these zones, we introduce a frontier zone F(n) = {(a,/3), \a — j3\ < e(n)}. 

The estimation of the parameter (a, (3) and the confidence region are similar to the one channel case 
(see Section rCTj) . The Assumption Q] of Theorem [T] is thus satisfied. Moreover, given the simple geometry 
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a 

Figure 8: Policy zones and frontier for the N stochastically identical channels model. 

of the frontier zone, Assumption [2] is easily verified. Indeed, any confidence rectangle whose length is less 
than e(n)/2 is cither included in the frontier zone or in one of the policy zones. Moreover, for any point in 
the frontier zone, there exists a point which is at a distance less than e(n) and is also in the frontier zone 
but belongs to the other policy zone. Finally, the approximations of the long-term rewards V*~p and V*J 3 
defined in (jlip are Lipschitz functions, and hence the third condition of Theorem [1] is satisfied. 

4.2.3 Experimental Results 

To illustrate the performance of the approach, we ran the tiling algorithm for a grid of values of (a* , (3* ) 
regularly covering the set [77, 1 — rj], with r] = 0.01. For each value of the parameter, 10 Monte Carlo 
replications of the data were processed. The time horizon is n = 10, 000 and the width e(n) of the frontier 
zone is taken equal to 0.15. The resulting cumulated regret has an empirical distribution which does not vary 
much with the actual value of the parameter and is, on average, smaller than 90. However, it may be observed 
that the average length of the exploration phase T n , represented in Figure O depends on the value of (a*, /?*). 
First observe that T n is quite large for (a*,/3*) close to the frontier zone and small otherwise. Indeed, when 
the actual parameter is far from the policy frontier, the exploration phase runs until the confidence region 
is included in the corresponding policy zone, which is achieved very rapidly. On the contrary, when the true 
parameter is inside the frontier zone, the exploration phase lasts longer. Remark that for parameter values 
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that sit exactly on the policy frontier both policies are indeed equivalent. This observation is captured, to 
some extent, by the algorithm as the maximal durations of the exploration phase do not occur exactly on the 
policy frontier. The second important observation is that the exploration phase is the longest when (a*,/3*) 
is close to (0,0) or (1, 1). Actually, when (a*, (3*) is around (0,0) (rcsp. (1, 1)), the channel is really often 
busy (resp. idle) and hence it is difficult to estimate (3 (resp. a). 




0.2 0.4 t 0.6 0.8 

a 

Figure 9: Length of the exploration phase for the tiling algorithm for different values of (a*, (3*). 

The later effect is partially predicted by the asymptotic approach of [5] who used the Central Limit 
Theorem to show that the length of the exploration phase, for a channel with transition probabilities (a*, /?*), 
has to be equal to 

l expl (a*,0*,6,P G ) = [ - y-^-(l - «*)(- + (12) 

in order to guarantee that a is properly estimated (with a similar result holding for (3). In (fT2|) $ stands for the 
standard normal cumulative distribution function and S and Pc are values such that Pc = P(|d— a*\ < 5a*). 
This formula rightly suggests that when a* is very small, there are very few observed transitions from the 
busy to the idle state and hence that estimating a is a difficult task. However, it can be seen on Figure [9] 
that with the tiling algorithm, the length of the exploration phase is actually longer when both a and (3 are 
very small but is not particularly long when a is small and (3 is close to one (upper left corner in Figure [9]). 
Indeed in the latter case, the channel state is very persistent, which imply few observed transitions and, 
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correlatively, that estimating either a or (3 would necessitate many observation. On the other hand, in 
this case the channel is strongly positively correlated and even a few observations suffice to decide that the 
appropriate policy is 7r + rather than 7r_ . 

5 Conclusion 

The tiling algorithm is a model-based reinforcement learning algorithm applicable to opportunistic channel 
access. This algorithm is meant to adequately balance exploration and exploitation by adaptivcly monitoring 
the duration of the exploration phase so as to guarantee a (logn) 1 / 3 n 2 / 3 worst-case regret bounds for a pre- 
specified finite horizon n. Furthermore, it has been shown in Theorem [5] that in large regions of the parameter 
space, the regret can indeed be guaranteed to be logarithmic. In numerical experiments on the single channel 
and stochastically identical channels models, it has been observed that the tiling algorithm is indeed able 
to adapt the length of the exploration phase, depending on the sequence of observations. Furthermore, we 
observed in the stochastically identical model that the algorithm was able to interrupt the exploration phase 
rapidly in cases where the nature of the optimal policy is rather obvious. 

For the future, the tiling algorithm promises as well a high potential for other applications for example 
in wireless communications. Concerning the opportunistic channel access, the algorithm as it stands is not 
able to handle the general N channel model presented Section [2] (with stochastically non-identical channels). 
However, another interesting prospective work would be to adapt our approach such that its main principles 
apply to the general model. 

A Appendix: Proof of Theorem [I] 

The confidence zone is such that, at the end of the exploration phase, Pg* (9* <G A t , 5 (At) < CiyTogn/ ^/t) > 
1 — c\ exp{— i logn} . At the end of the exploration phase, if the true parameter 8* is in the confidence 
region, there are two possibilities: either the confidence zone At is included in a policy zone Zi or it is 
included in a frontier zone Fj(n). If the confidence zone is in a policy region, the regret is equal to the 
sum of the duration of the exploration phase and of the loss corresponding to the case where the confidence 
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region is violated: R n {9*) = Ee»(T' n ) + c^n exp{— | log n} . If the confidence zone is in a frontier region 
Fj (n) , an additional term of the regret is the loss due to the fact that the policy selected at the end of the 
exploration phase is not necessarily the optimal one for the true parameter 9*. Let tt* denote the optimal 
policy for 9* and ir'^ the selected policy. Note that Zi and Zk arc compatible with Fj(n). The loss is 
VgJ ~ Vg} = (V g Z* - Vg* ) + {Vg 1 - Vg} ) + (Vg* - V e K ) , where 9 £ Z k ft Fj (n) . The last term is negative 
since nt is the optimal policy for 9. The two other terms can be bounded using Assumption [3J Then, 
| V e J - V e , k \ < (di + d k )\\9* - 011^ . According to Assumption [21 one can choose 9 such that \\9* — 0\\ < 
c' 2 e(n) for which R n (6*) < Kg* (T n ) + nc'e(n) + c' 1 ncxp{— | logn} , where c' = c' 2 max^ t k(di + df.) ■ 

The maximal regret is obtained when the confidence region belongs to a frontier zone. According to 
Assumptions Q] and if t satisfies ci(logn/<) 1 / 2 < C2e(n) then t > T n , with large probability. Therefore, 
Ee*(T„) < (cf logn)/(c 2 e(n)) 2 . The regret is then bounded by 



(?" lo^£ ft X 
maxi?„(#*) < l + nc'e(n) + c\n exp{— - logn} 

<4 e w 



which is minimized for e(n) = f ~is~r~~ ) 



1/3 



B Appendix: Proof of Theorem [2] 

The condition mhig^z \9* — 9\ > n means that the distance between 9* and any border of the policy zone 
Z is larger than k. Hence, as soon as 5 (At) < k, the confidence region A t is included in the policy zone 
Z. The regret of the tiling algorithm is then equal to R n (9*) = Eg*(T n ) + c[n exp{— 2x} . According to 
Assumption QJ if t satisfies c^x/t) 1 ' 2 < k then t > T n with large probability. Therefore, Ee*(T„) < c\x/n 2 
and the regret is bounded by R n {9*) = ^j- + c^nexpj— 2a:} , which is minimized for x = log ( 2c i^ fi - / c i) % p or 

2 

this value of x, we have R n (9*) = ^s-(log(n) + log(2c' 1 K 2 /c 2 ) + 1) . 
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C Appendix: Confidence interval for Markov Chains 



In this appendix, we prove that the confidence region A t defined in equation §§§ satisfies Assumption [T] 
First, remark that the event {5{A t ) < ci^jp} = {7V t ° > cf, > cf} for a = Hence, using 

the Hoeffding inequality, we have P( a ,p) (( a jP) £ At, 6 (At) < c\ " ^ < 4exp{— |logn} . Moreover, we 
need to bound the probability P (s(A t ) > ci^^j . We apply Theorem 2 of [3] to bound P (N? < cf) . To 
do so, remark that inf a ,g v\ = r\ and that the minoration constant 1 — |/3 — a\ is lower-bounded by 2rj. We 
then have 

P (V < c|) < P (Nl - ^ < -(1 - c/2)v x t) < exp{- 4??2(f2r?(1 ^ /2) ' 1/?7)2 } < exp{-i log(n)} , 

where the last inequality holds for t > t n == (8/3 log(n)?7 _4 (2 — c) -2 ) 1 / 3 . Similarly, we can show that, for 
t > t n , P(N? < cf) < exp{-± log(n)} . Hence, for all t > t n , P (<5(A t ) > ci^Sp) < 2 exp{-| log(n)} . In 



addition, for all t < t n , c n /^ > c n /^ > 1 , for n > cxp{3 x 2- 3 / 2 c 3 / 2 (2 - c)- 1 ^ 1 / 2 } = n min . Then, 



for t < t n and n > n m i n , the event {5(A t ) < c\ ^'"g" } is always verified. To conclude, we have 



V/3) e A tj <y(A t )<ci 



«J(A t )> Cl ^^) -P (a , w ((a.jfJJ^A,, S(A t )<c 1 ^^ S j >l-6cxp{-ilog(n)} 
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